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Discovering relevant, but possibly hidden, variables is a 
key step in constructing useful and predictive theories about 
the natural world. This brief note explains the connections 
between three approaches to this problem: the recently intro- 
duced information-bottleneck method, the computational me- 
chanics approach to inferring optimal models, and Salmon's 
statistical relevance basis. 



I. INTRODUCTION 

Recently, Tishby, Pereira, and Bialek proposed a new 
method for finding concise representations of the infor- 
mation one set of variables contains about another [Q. 
This brief note explains the connections between this 
"information-bottleneck" method and existing mathe- 
matical frameworks and techniques. This comparison 
should enhance the value of research in this promising 
direction and clarify the relative uses of these techniques 
in applications. 

In the interest of space, we assume readers are familiar 
with the notation of both jjj] and Q . 



II. THE INFORMATION-BOTTLENECK 
METHOD 

In Q the authors pose the following problem. Given 
a joint distribution over two random variables — the "in- 
put" X and the "output" Y, find an intermediate or "bot- 
tleneck" variable X which is a (possibly stochastic) func- 
tion of X such that X is more compressed than X, but 
retains predictive information about Y. More exactly, 
they ask for a conditional distribution Pr(x|x) that min- 
imizes the functional 



T = I[X;X] -f3I[X;Y] , 



(1) 



where I[W, Z] is the mutual information between random 
variables W and Z [p| and (3 is a positive real number. 



Minimizing the first term represents the desire to find 
a compression of the original input data X; maximizing 
the second term represents the desire to retain the ability 
to predict Y^ The coefficient (i governs the trade-off 
between these two goals: as (3 — > 0, we lose interest in 
prediction in favor of compression; whereas as (3 — > oo, 
predictive ability becomes paramount. 

Extending classical rate-distortion theory, the authors 
are not only able to state self-consistent equations that 
determine which distributions satisfy this variational 
problem, but give a convergent iterative procedure that 
finds one of these distributions. They do not address the 
rate of convergence. 



III. CAUSAL STATES FOR 
TRANSDUCER-FUNCTIONALS 

In [p| and earlier publications, we defined causal states 
for stationary stochastic processes as follows. Two histo- 
ries s and s belong to the same causal state S if and 
only if 
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(2) 



for all s L and for all L. That is, two histories belong to 
the same causal state if and only if they give the same 
conditional distribution for futures. In M we showed that 
the causal states defined by Eq. (^) possess two kinds of 

optimality. First, their ability to predict the future S 
is maximal. Second, they are the simplest such set of 
states. 

Thus, we showed that the set S of causal states is the 
solution to the following optimization problem. Given 

the joint distribution Pr(5, S) over the past S and fu- 
ture S, find the function (equivalently, partition) e of S 
such that (i) the conditional entropy H[S \t(S)] is mini- 
mized for all L, and (ii) the entropy H[e(S)] is minimized 
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1 Since X — g(X, Q) for some auxiliary random variable Q, a 
theorem of Shannon's assures us that Y] < i"[Jf;Y] and 
the transformation from X to X cannot increase our ability 
to predict Y §, App. 7]. 
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among all functions i) that satisfy condition (i)J^] The 
equivalence classes induced by e are the causal states, 
and they are the unique^ solution to the optimization 
problem. A moment's reflection shows that this opti- 
mization is equivalent to first maximizing the mutual in- 
formation between the effective states and the futures 
and then minimizing the mutual information between the 
effective states and the histories. The first step maxi- 
mizes the predictive ability of the effective states and the 
second selects the most concise set of states. Note that 
the opposite sequence of optimizations first minimizing 
complexity and then maximizing predictability — is triv- 
ial, since it produces a single-state model that describes 
an IID sequence of random variables; e.g., a biased coin 
or die. 



IV. MEMORYLESS TRANSDUCERS 

As has been remarked earlier — e.g., [|| and |(| — the 
causal-state construction is not intrinsically limited to 
time series. Of particular interest here is the case of 
transducer-functionals: when one sequence of variables 
is a functional of another sequence, possibly a stochastic 
functional. In this case, one can construct causal states 
that (i) retain all predictive information about the output 
series, (ii) are deterministic functions of the prior causal 
state and the most recent value of the input series, and 
(iii) minimize the statistical complexity of (information 
stored in) the causal states. We present the theory of 
causal states for general transducers elsewhere. 

In the case where the output depends on the current 
input alone — the case of memoryless transduction — the 
causal states assume a particularly simple form: two in- 
puts x and x' belong to the same causal state if and only 
if 

Pr(y = y\X = x) = Pr(Y~ = y\X = x') , (3) 

for all In this case, it can be shown that 

H[Y\e(X)] < H[Y\r)(X)] , (4) 

for any other partition of the inputs 77 and that, among all 
the rival partitions f\ minimizing the conditional entropy 
of the outputs (the prescient rivals of e), 



2 The states induced by the partition 7) are called the pre- 
scient rivals of the causal states induced by e. The entropy 

H[e(S)] is called the statistical complexity and measures the 
"size" of, or amount of information stored in, the causal 
states. 

3 More precisely, any other function satisfying conditions (i) 
and (ii) may differ from e on at most a set of histories of 
measure zero. 

4 A somewhat more complicated construction is necessary 
when the transducer exhibits memory. 



H[e(X)] < H[r%X)} . (5) 

This is to say, the causal states are the most compressed 
hidden variables. In the sense of [|), they are optimal 
bottleneck variables. 

One concludes that these are precisely what should be 
delivered by the information-bottleneck method in the 
limit where (3 — > 00. It is not immediately obvious that 
the iterative procedure of [Q is still valid in this limit. 
Nonetheless, that e is the partition satisfying their origi- 
nal constraints is evident. 

We note in passing that, as shown in ||, prescient 
rivals — those sets of states that retain all predictive in- 
formation from the original inputs while compressing 
them — are sufficient statistics. Conversely, jl]] states 
that, when sufficient statistics exist, then compression- 
with-prediction is possible. 

V. THE STATISTICAL RELEVANCE BASIS 

Before closing, we point out another solution to the 
problem of discovering concise and predictive hidden 
variables. In his books |tJ and @], Salmon put forward 
a construction, under the name of the "statistical rele- 
vance basis" , that is identical in its essentials with that of 
causal states for memoryless transducers.^ Owing to the 
rather different aims for which Salmon's construction was 
intended — explicating the notion of "causation" in the 
philosophy of science, no one seems to have proved its 
information-theoretic optimality properties nor even to 
have noted its connection to sufficient statistics. (Briefly: 
if a nontrivial sufficient partition of the input variables 
exists, then the relevance basis is the minimal sufficient 
partition. These proofs will appear elsewhere.) 

VI. COMPARISON AND CONCLUSION 

Recapitulating, the causal states for memoryless trans- 
duction coincide with the cells of the "bottleneck" par- 
tition in the limit (3 — > 00. Moreover, both are identical 
with the statistical relevance basis of Salmon. 

The construction of the causal states does not allow us 
to discard any predictive information about the output 
y, even if this might allow for a substantial reduction 
in the statistical complexity. The bottleneck method, by 
contrast, generally throws away some predictive infor- 
mation. It trades one bit less statistical complexity for 
1//3 bits less predictive information. Of course, the lin- 
ear trade-off and the particular value of the coefficient 



5 Salmon's work only came to our attention in mid-1998, and 
thus it is not cited in our publications including and prior to 
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f3 that controls it are ad hoc choices. Whether this is 
acceptable in applications would seem to depend on the 
goal. For example, if the goal is a practical "lossy" data- 
compression scheme, the bottleneck method recommends 
itself. However, if the goal is representing the intrinsic 
computation or causal structure of some natural process, 
causal states are better suited to the task. 
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